feat(server): reduce layer-split activation memory with backend precision policy by weicj · Pull Request #310 · Luce-Org/lucebox-hub

weicj · 2026-05-29T17:30:09Z

Summary

This PR reduces long-context target layer-split OOM risk by storing layer-split activation staging buffers in a backend-appropriate dtype instead of always using F32. The graph still casts back to F32 at ggml RMSNorm / norm-weight boundaries, so the lower-precision storage does not change those operator contracts.

The allocation reduced here is hidden_size * context_tokens * bytes_per_element * staging_buffer_count. Qwen35 and Laguna currently use two staging buffers; Gemma4 uses three because its per-layer input path also keeps an original embedding staging buffer.

At a 256K context cap, the staging allocation changes from:

Qwen35 / Qwen3.6-27B / hidden=5120: 10240 MiB -> 5120 MiB.
Laguna-XS.2 / hidden=2048: 4096 MiB -> 2048 MiB.
Gemma4 31B / hidden=5376: 16128 MiB -> 8064 MiB.

Changes

Add BackendActivationPolicy in common/backend_precision for target layer-split activation staging.
Select activation dtype from backend architecture:
- CUDA Ampere+ and HIP native-BF16 targets -> BF16.
- CUDA tensor-core / GP100 and HIP gfx9/gfx10 targets without native BF16 -> F16.
- Older or unknown CUDA/HIP targets -> F32.
Add LUCEBOX_LAYER_SPLIT_ACT_TYPE=f32|f16|bf16 as an explicit local override.
Extend shared layer-split activation buffers to allocate F32, F16, or BF16 tensors and upload F32 embeddings into the selected storage dtype.
Add common/ggml_graph_precision.h for graph-side F32 casts.
Make Qwen35, Gemma4, and Laguna RMSNorm / norm-weight graph boundaries F32-safe.
Wire Qwen35, Gemma4, and Laguna target layer-split adapters through the shared activation policy.
Add no-GPU unit coverage for the CUDA SM and HIP gfx dtype-selection tables.

Notes

A real 16K Qwen3.6-27B Q4 HIP layer-split A/B on dual Pro VII reduced request-time peak VRAM from 10199/11224 MiB to 9896/10918 MiB, saving 303/306 MiB across the two cards. This matches the ~313 MiB expected from the sizing formula at 16K.
Runtime smoke passed on dual Pro VII / gfx906 HIP target layer split with the default F16 activation path for Qwen3.6-27B Q4, Laguna-XS.2 Q4, and Gemma4 31B Q3.
This PR is intended as a follow-up to refactor(server): share target layer-split runtime helpers #306 and feat(dflash): reduce feature mirror memory with dtype policy #309: refactor(server): share target layer-split runtime helpers #306 provides the shared layer-split runtime structure, while feat(dflash): reduce feature mirror memory with dtype policy #309 covers DFlash draft feature mirror storage.

Record PR Luce-Org#310 integration, refreshed PR classification, retained probe worktrees, and validation outcomes for the unattended integration run.

cubic-dev-ai

4 issues found across 30 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…sion policy

Apply the updated Luce-Org#310 backend activation precision policy changes on top of the existing auto-integration conflict resolution, preserving the Qwen35 MoE persistent logits graph resolution already carried in the stack.

Record the 2026-05-29 14:03 cron pass, confirming Luce-Org#309/Luce-Org#310 and all other current included PR heads remain ancestors of easel/auto-integration. Re-probe the remaining old non-draft PRs from the current integration tip and record their conflict sets and retained worktrees.

weicj added 3 commits May 29, 2026 02:07

fix(server): enable sampling for target layer split

a9aedf7

feat(server): add Laguna target-layer-split adapter

53dd168

refactor(server): share target layer-split runtime helpers

988fc93

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026

docs: refresh auto-integration manifest

434389f

Record PR Luce-Org#310 integration, refreshed PR classification, retained probe worktrees, and validation outcomes for the unattended integration run.

cubic-dev-ai Bot reviewed May 29, 2026

View reviewed changes

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp Outdated

Comment thread server/src/gemma4/gemma4_layer_split_adapter.cpp Outdated

Comment thread server/src/laguna/laguna_target_loader.cpp

Comment thread server/src/laguna/laguna_layer_split_adapter.cpp Outdated

feat(server): reduce layer-split activation memory with backend preci…

bf9f4b5

…sion policy

weicj force-pushed the feat-backend-activation-precision-policy-after-306 branch from e73c2e3 to bf9f4b5 Compare May 29, 2026 17:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(server): reduce layer-split activation memory with backend precision policy#310

feat(server): reduce layer-split activation memory with backend precision policy#310
weicj wants to merge 4 commits into
Luce-Org:mainfrom
weicj:feat-backend-activation-precision-policy-after-306

weicj commented May 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 29, 2026

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading